Metadata extraction and text categorization using Universal Resource Locator expansions
نویسنده
چکیده
Uniform resource locators (URLs), which mark the address of a resource on the World Wide Web, are often human-readable and can indicate metadata about a resource. This paper explores the mining of URLs to yield categoric metadata about web resources via a three-phase pipeline of word segmentation, abbreviation expansion and classification. I apply this approach to the problem of subject metadata generation and quantify its performance relative to titleand document-based methods, both which require the retrieval of the source document.
منابع مشابه
Metadata extration and text categorization using Universal Resource Locator expansions
Uniform resource locators (URLs), which mark the address of a resource on the World Wide Web, are often human-readable and can indicate metadata about a resource. This paper explores the mining of URLs to yield categoric metadata about web resources via a three-phase pipeline of word segmentation, abbreviation expansion and classification. I apply this approach to the problem of subject metadat...
متن کاملJoint Web-Feature (JFEAT): A Novel Web Page Classification Framework
With the increasing amount of web pages over the internet, it has been a major concern to obtain information on the internet accurately at a reasonable cost with decent performance. A potential solution is through the classification of web pages into meaningful categories. An effective classification of web pages is of benefit to various applications such as web mining and search engines. Unlik...
متن کاملCategorizing Learning Objects Based On Wikipedia as Substitute Corpus
As metadata is often not sufficiently provided by authors of Learning Resources, automatic metadata generation methods are used to create metadata afterwards. One kind of metadata is categorization, particularly the partition of Learning Resources into distinct subject categories. A disadvantage of state-of-the-art categorization methods is that they require corpora of sample Learning Resources...
متن کاملExploring Multidimensional Continuous Feature Space to Extract Relevant Words
With growing amounts of text data the descriptive metadata become more crucial in efficient processing of it. One kind of such metadata are keywords, which we can encounter e.g. in everyday browsing of webpages. Such metadata can be of benefit in various scenarios, such as web search or contentbased recommendation. We research keyword extraction problem from the perspective of vector space and ...
متن کاملPractical Issues for Automated Categorization of Web Sites
In this paper we discuss several issues related to automated text classification of web sites. We analyze the nature of web content and metadata and requirements for text features. We present an approach for targeted spidering including metadata extraction and opportunistic crawling of specific semantic hyperlinks. We describe a system for automatically classifying web sites into industry categ...
متن کامل